KAFKA-17182: Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer #17700

kirktrue · 2024-11-05T23:31:32Z

This change reduces fetch session cache evictions on the broker for AsyncKafkaConsumer by altering its logic to determine which partitions it includes in fetch requests.

Background

Consumer implementations fetch data from the cluster and temporarily buffer it in memory until the user next calls Consumer.poll(). When a fetch request is being generated, partitions that already have buffered data are not included in the fetch request.

The ClassicKafkaConsumer performs much of its fetch logic and network I/O in the application thread. On poll(), if there is any locally-buffered data, the ClassicKafkaConsumer does not fetch any new data and simply returns the buffered data to the user from poll().

On the other hand, the AsyncKafkaConsumer consumer splits its logic and network I/O between two threads, which results in a potential race condition during fetch. The AsyncKafkaConsumer also checks for buffered data on its application thread. If it finds there is none, it signals the background thread to create a fetch request. However, it's possible for the background thread to receive data from a previous fetch and buffer it before the fetch request logic starts. When that occurs, as the background thread creates a new fetch request, it skips any buffered data, which has the unintended result that those partitions get added to the fetch request's "to remove" set. This signals to the broker to remove those partitions from its internal cache.

This issue is technically possible in the ClassicKafkaConsumer too, since the heartbeat thread performs network I/O in addition to the application thread. However, because of the frequency at which the AsyncKafkaConsumer's background thread runs, it is ~100x more likely to happen.

Options

The core decision is: what should the background thread do if it is asked to create a fetch request and it discovers there's buffered data. There were multiple proposals to address this issue in the AsyncKafkaConsumer. Among them are:

The background thread should omit buffered partitions from the fetch request as before (this is the existing behavior)
The background thread should skip the fetch request generation entirely if there are any buffered partitions
The background thread should include buffered partitions in the fetch request, but use a small “max bytes” value
The background thread should skip fetching from the nodes that have buffered partitions

Option 4 won out. The change is localized to AbstractFetch where the basic idea is to skip fetch requests to a given node if that node is the leader for buffered data. By preventing a fetch request from being sent to that node, it won't have any "holes" where the buffered partitions should be.

Testing

Eviction rate testing

Here are the results of our internal stress testing:

ClassicKafkaConsumer—after the initial spike during test start up, the average rate settles down to ~0.14 evictions/second
AsyncKafkaConsumer, (w/o fix)—after startup, the evictions still settle down, but they are about 100x higher than the ClassicKafkaConsumer at ~1.48 evictions/second
AsyncKafkaConsumer (w/ fix)—the eviction rate is now closer to the ClassicKafkaConsumer at ~0.22 evictions/second

`EndToEndLatency` testing

The bundled EndToEndLatency test runner was executed on a single machine using Docker. The apache/kafka:latest Docker image was used and either the cluster/combined/plaintext/docker-compose.yml or single-node/plaintext/docker-compose.yml Docker Compose configuration files, depending on the test. The Docker containers were recreated from scratch before each test.

A single topic was created with 30 partitions and with a replication factor of either 1 or 3, depending on a single- or multi-node setup.

For each of the test runs these argument values were used:

Message count: 100000
acks: 1
Message size: 128 bytes

A configuration file which contained a single configuration value of group.protocol=<$group_protocol> was also provided to the test, where $group_protocol was either CLASSIC or CONSUMER.

Test results

Test 1—`CLASSIC` group protocol, cluster size: 3 nodes, replication factor: 3

Metric	`trunk`	PR
Average latency	1.4901	1.4871
50th percentile	1	1
99th percentile	3	3
99.9th percentile	6	6

Test 2—`CONSUMER` group protocol, cluster size: 3 nodes, replication factor: 3

Metric	`trunk`	PR
Average latency	1.4704	1.4807
50th percentile	1	1
99th percentile	3	3
99.9th percentile	6	7

Test 3—`CLASSIC` group protocol, cluster size: 1 node, replication factor: 1

Metric	`trunk`	PR
Average latency	1.0777	1.0193
50th percentile	1	1
99th percentile	2	2
99.9th percentile	5	4

Test 4—`CONSUMER` group protocol, cluster size: 1 node, replication factor: 1

Metric	`trunk`	PR
Average latency	1.0937	1.0503
50th percentile	1	1
99th percentile	2	2
99.9th percentile	4	4

Conclusion

These tests did not reveal any significant differences between the current fetcher logic on trunk and the one proposed in this PR. Addition test runs using larger message counts and/or larger message sizes did not affect the result.

… the new consumer Updated the FetchRequestManager to only create and enqueue fetch requests when signaled to do so by a FetchEvent.

…licitly

…equests

…om prepareFetchRequests()

…s the same

Fixed typo

junrao

@kirktrue : Thanks the updated PR. The code LGTM. Are the test failures related?

clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetchRequestManagerTest.java

kirktrue · 2025-01-29T01:35:44Z

@kirktrue : Thanks the updated PR. The code LGTM. Are the test failures related?

I don't believe they are, no. I'll look at the failures from the current test run and dig around a little to see if others are hitting them too and report back.

Thanks.

kirktrue · 2025-01-29T20:54:36Z

@junrao—the majority of the errors I see in the latest test run are not related.

The following test failure occurs on both Java 17 and 23, but the issue has been filed several times:

KAFKA-18441, KAFKA-18550, KAFKA-18441, KAFKA-18639: KafkaAdminClientTest > testAdminClientApisAuthenticationFailure()

The following tests are flaky, and have issues filed:

KAFKA-15474: AbstractCoordinatorTest > testWakeupAfterSyncGroupReceivedExternalCompletion()
KAFKA-8031, KAFKA-8032, KAFKA-8107, KAFKA-15960: ClientIdQuotaTest > testQuotaOverrideDelete(String, String).quorum=kraft.groupProtocol=consumer
KAFKA-15900, KAFKA-18551, KAFKA-18639, : EagerConsumerCoordinatorTest > testOutdatedCoordinatorAssignment()
KAFKA-18298: PlaintextAdminIntegrationTest > testConsumerGroupsDeprecatedConsumerGroupState(String, String).quorum=kraft.groupProtocol=consumer
KAFKA-13514: StickyAssignorTest > testLargeAssignmentAndGroupWithUniformSubscription(boolean).hasConsumerRack = false

The only issue that isn't filed is this:

StickyAssignorTest > testAssignmentAndGroupWithNonEqualSubscriptionNotTimeout(boolean).hasConsumerRack = false

I'll see if I can reproduce that flaky test locally.

kirktrue · 2025-01-30T01:33:48Z

@junrao—I wasn't able to reproduce the flaky behavior in StickyAssignorTest locally. However, StickyAssignorTest is a proper unit test in that it focuses on the ConsumerPartitionAssignor logic. It doesn't use a KafkaConsumer, so it isn't executing any code related to the fetcher. It's safe to conclude that the failing and flaky tests are unrelated to this change.

kirktrue · 2025-01-30T16:56:37Z

@junrao—the latest test run has a few flaky tests, but they're all known flaky tests that are filed in Jira.

Are we able to merge this change, or should we wait for green build?

Thanks!

lianetm · 2025-01-30T17:05:28Z

@kirktrue the fix for the failed test here has been merged to trunk and builds are green again. Get the latest changes and we should be green here too

kirktrue · 2025-01-30T18:00:37Z

@kirktrue the fix for the failed test here has been merged to trunk and builds are green again. Get the latest changes and we should be green here too

Great! Updated my branch and re-running the build. Thanks @lianetm!

kirktrue · 2025-01-30T20:32:27Z

@junrao @lianetm @jeffkbkim—all green! Can we merge? 🥺

junrao

@kirktrue : Thanks for triaging the tests. LGTM

kirktrue · 2025-01-30T21:12:33Z

🥳

lianetm

LGTM too, thanks @kirktrue !

junrao · 2025-01-30T21:12:35Z

@kirktrue : Do we want to create a separate PR to cherry-pick this to 4.0?

kirktrue · 2025-01-30T22:39:50Z

@kirktrue : Do we want to create a separate PR to cherry-pick this to 4.0?

Yes. Is that step performed by the merge-r or the contributor? Sometimes the person merging to trunk also handles the cherry-pick, sometimes not. Is there a general rule of thumb for that?

Thanks

dajac · 2025-01-31T07:15:36Z

@junrao @kirktrue If the cherry-pick is clean, we can just cherry-pick it to 4.0 branch.

…ncKafkaConsumer (#17700) This change reduces fetch session cache evictions on the broker for AsyncKafkaConsumer by altering its logic to determine which partitions it includes in fetch requests. Background Consumer implementations fetch data from the cluster and temporarily buffer it in memory until the user next calls Consumer.poll(). When a fetch request is being generated, partitions that already have buffered data are not included in the fetch request. The ClassicKafkaConsumer performs much of its fetch logic and network I/O in the application thread. On poll(), if there is any locally-buffered data, the ClassicKafkaConsumer does not fetch any new data and simply returns the buffered data to the user from poll(). On the other hand, the AsyncKafkaConsumer consumer splits its logic and network I/O between two threads, which results in a potential race condition during fetch. The AsyncKafkaConsumer also checks for buffered data on its application thread. If it finds there is none, it signals the background thread to create a fetch request. However, it's possible for the background thread to receive data from a previous fetch and buffer it before the fetch request logic starts. When that occurs, as the background thread creates a new fetch request, it skips any buffered data, which has the unintended result that those partitions get added to the fetch request's "to remove" set. This signals to the broker to remove those partitions from its internal cache. This issue is technically possible in the ClassicKafkaConsumer too, since the heartbeat thread performs network I/O in addition to the application thread. However, because of the frequency at which the AsyncKafkaConsumer's background thread runs, it is ~100x more likely to happen. Options The core decision is: what should the background thread do if it is asked to create a fetch request and it discovers there's buffered data. There were multiple proposals to address this issue in the AsyncKafkaConsumer. Among them are: The background thread should omit buffered partitions from the fetch request as before (this is the existing behavior) The background thread should skip the fetch request generation entirely if there are any buffered partitions The background thread should include buffered partitions in the fetch request, but use a small “max bytes” value The background thread should skip fetching from the nodes that have buffered partitions Option 4 won out. The change is localized to AbstractFetch where the basic idea is to skip fetch requests to a given node if that node is the leader for buffered data. By preventing a fetch request from being sent to that node, it won't have any "holes" where the buffered partitions should be. Reviewers: Lianet Magrans <[email protected]>, Jeff Kim <[email protected]>, Jun Rao <[email protected]>

lianetm · 2025-01-31T14:15:48Z

Done, manually cherry-picked to 4.0 (clean) 9c02072

Thanks @kirktrue !
cc. @junrao

jeffkbkim · 2025-01-31T15:18:13Z

Thanks for pushing this through @kirktrue!

jolshan · 2025-01-31T21:35:45Z

Hey folks, I'm tracing the failure of https://develocity.apache.org/scans/tests?search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&tests.container=org.apache.kafka.streams.integration.StandbyTaskEOSMultiRebalanceIntegrationTest&tests.test=shouldHonorEOSWhenUsingCachingAndStandbyReplicas() back to this PR. Can we look into this? It failed only for this PR on Jan 27, 28, 29 and then once it was merged, we see the failure a lot on trunk.

mjsax · 2025-01-31T22:16:16Z

Cf https://issues.apache.org/jira/browse/KAFKA-18686

…with AsyncKafkaConsumer (#17700)" This reverts commit 6cf54c4.

…with AsyncKafkaConsumer (#17700)" This reverts commit 9c02072.

lianetm · 2025-01-31T22:27:38Z

Reverted the change to stabilize while it's investigated.

Reverted in trunk -> 7920fad
Reverted in 4.0 -> fc3dca4

…with AsyncKafkaConsumer (apache#17700)" This reverts commit 9c02072.

kirktrue added 11 commits September 5, 2024 12:10

KAFKA-17439: Make polling for new records an explicit action/event in…

e984638

… the new consumer Updated the FetchRequestManager to only create and enqueue fetch requests when signaled to do so by a FetchEvent.

Merge remote-tracking branch 'origin/trunk' into KAFKA-17439-poll-exp…

b6af23b

…licitly

Minor tweaks to FetchEvent documentation.

335c249

Update FetchRequestManager.java

2abd6a4

Added unit tests to exercise poll, request-then-poll, and duplicate r…

45bc8c0

…equests

Updated FetchEvent to CreateFetchRequestsEvent and catching errors fr…

7265404

…om prepareFetchRequests()

Fixed spacing issue that checkstyle wasn't happy with

5449549

Merge branch 'trunk' into KAFKA-17439-poll-explicitly

f7a5940

PoC for adding buffered partitions to the fetch request

38fbb00

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

f6f8c21

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

9864864

github-actions bot added consumer clients labels Nov 5, 2024

Removed superfluous imports

d5ba79e

kirktrue changed the title ~~KAFKA-17439: Make polling for new records an explicit action/event in the new consumer~~ KAFKA-17182: Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer Nov 5, 2024

kirktrue added ctr Consumer Threading Refactor (KIP-848) ci-approved labels Nov 5, 2024

kirktrue added 12 commits November 5, 2024 17:24

Updates to ensure that the core logic for the classic consumer remain…

cafcaf2

…s the same

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

eda5922

Reverting some changes to avoid unnecessary diffs

668e232

More tweaking to avoid diffs

ceeb8e2

More diff reduction

8c86cb6

More diff

af831a5

More clean up

705fad1

More tweaks to avoid tweaking existing code

df30eb3

More tweaks

936e27a

Update AbstractFetch.java

f2cdc49

Updates for clarity

5537945

Update FetchRequestManager.java

671b1f3

Fixed typo

kirktrue added the KIP-848 The Next Generation of the Consumer Rebalance Protocol label Nov 7, 2024

junrao reviewed Jan 28, 2025

View reviewed changes

clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetchRequestManagerTest.java Outdated Show resolved Hide resolved

Typo

290076c

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

a387361

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

ba45aff

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

c41ce46

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

d488d02

junrao approved these changes Jan 30, 2025

View reviewed changes

junrao merged commit 6cf54c4 into apache:trunk Jan 30, 2025
9 checks passed

lianetm approved these changes Jan 30, 2025

View reviewed changes

lianetm added a commit that referenced this pull request Jan 31, 2025

Revert "KAFKA-17182: Consumer fetch sessions are evicted too quickly …

7920fad

…with AsyncKafkaConsumer (#17700)" This reverts commit 6cf54c4.

lianetm added a commit that referenced this pull request Jan 31, 2025

Revert "KAFKA-17182: Consumer fetch sessions are evicted too quickly …

fc3dca4

…with AsyncKafkaConsumer (#17700)" This reverts commit 9c02072.

airlock-confluentinc bot pushed a commit to confluentinc/kafka that referenced this pull request Feb 3, 2025

Revert "KAFKA-17182: Consumer fetch sessions are evicted too quickly …

a7b6560

…with AsyncKafkaConsumer (apache#17700)" This reverts commit 9c02072.

kirktrue mentioned this pull request Feb 4, 2025

KAFKA-17182: Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer #18795

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-17182: Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer #17700

KAFKA-17182: Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer #17700

kirktrue commented Nov 5, 2024 •

edited

Loading

junrao left a comment

kirktrue commented Jan 29, 2025

kirktrue commented Jan 29, 2025

kirktrue commented Jan 30, 2025 •

edited

Loading

kirktrue commented Jan 30, 2025

lianetm commented Jan 30, 2025

kirktrue commented Jan 30, 2025

kirktrue commented Jan 30, 2025

junrao left a comment

kirktrue commented Jan 30, 2025

lianetm left a comment

junrao commented Jan 30, 2025

kirktrue commented Jan 30, 2025

dajac commented Jan 31, 2025

lianetm commented Jan 31, 2025

jeffkbkim commented Jan 31, 2025

jolshan commented Jan 31, 2025

mjsax commented Jan 31, 2025

lianetm commented Jan 31, 2025

KAFKA-17182: Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer #17700

KAFKA-17182: Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer #17700

Conversation

kirktrue commented Nov 5, 2024 • edited Loading

Background

Options

Testing

Eviction rate testing

EndToEndLatency testing

Test results

Test 1—CLASSIC group protocol, cluster size: 3 nodes, replication factor: 3

Test 2—CONSUMER group protocol, cluster size: 3 nodes, replication factor: 3

Test 3—CLASSIC group protocol, cluster size: 1 node, replication factor: 1

Test 4—CONSUMER group protocol, cluster size: 1 node, replication factor: 1

Conclusion

junrao left a comment

Choose a reason for hiding this comment

kirktrue commented Jan 29, 2025

kirktrue commented Jan 29, 2025

kirktrue commented Jan 30, 2025 • edited Loading

kirktrue commented Jan 30, 2025

lianetm commented Jan 30, 2025

kirktrue commented Jan 30, 2025

kirktrue commented Jan 30, 2025

junrao left a comment

Choose a reason for hiding this comment

kirktrue commented Jan 30, 2025

lianetm left a comment

Choose a reason for hiding this comment

junrao commented Jan 30, 2025

kirktrue commented Jan 30, 2025

dajac commented Jan 31, 2025

lianetm commented Jan 31, 2025

jeffkbkim commented Jan 31, 2025

jolshan commented Jan 31, 2025

mjsax commented Jan 31, 2025

lianetm commented Jan 31, 2025

kirktrue commented Nov 5, 2024 •

edited

Loading

`EndToEndLatency` testing

Test 1—`CLASSIC` group protocol, cluster size: 3 nodes, replication factor: 3

Test 2—`CONSUMER` group protocol, cluster size: 3 nodes, replication factor: 3

Test 3—`CLASSIC` group protocol, cluster size: 1 node, replication factor: 1

Test 4—`CONSUMER` group protocol, cluster size: 1 node, replication factor: 1

kirktrue commented Jan 30, 2025 •

edited

Loading